Huilin Zhou (hz2507)
Zhuo Li (zl2637)
Qing Xu (qx2178)
Zhaoyu Liu (zl2638)
Many of us will find a job in recent years. In New York City, what type of job can be payed for higher salary? If that’s the ideal job for me , what skills should I have before applying for it? Also where should I rent an apartment if I want to live close to that type of positions? By looking at the NYC jobs data set from 2013 to 2017,we wish to give some advice on job catrgory, required skills and salary for people who are seeking for a job.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Source: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data
Scrape:We doWnloaded the data set from the website and the original data set contains 3174 job information. We selected data with information about work location, job category, preferred skills and full/part time. We want to use Google Map to get longtitute and latitute from work locations. Since the Google map has a limit of 2500 observation one time, we selected first 2500 observations from the data set.
Cleaning:
1. Merged job category if they are the same kind but just have different names. Finally we got 12 kinds of job categories.
2. Unified salary unit to “Annual” and recalculated the salary range and average salary.
3. Used Google Map to change location to longitute and latitute.
## Warning: geocode failed with status ZERO_RESULTS, location = "Office for
## Exec Proj Manager"
## Warning: geocode failed with status ZERO_RESULTS, location = "Office for
## Exec Proj Manager"
## Warning: geocode failed with status ZERO_RESULTS, location = "CP Cap Plan-
## Technical Planning"
## Warning: geocode failed with status ZERO_RESULTS, location = "Capital
## Projects-VP"
## Warning: geocode failed with status ZERO_RESULTS, location = "Capital
## Projects-VP"
## Warning: geocode failed with status ZERO_RESULTS, location = "REES - Adult
## Ed & Training"
## Warning: geocode failed with status ZERO_RESULTS, location = "REES - Adult
## Ed & Training"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "1 Centre
## St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "1 Centre
## St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "80 Maiden
## Lane"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "421 East
## 26th Street NY NY"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "42-09
## 28th Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Not Used"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "1 Centre
## St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "255
## Greenwich Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Not Used"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "90-37
## Parsons Blvd., Queens"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "120
## Broadway, New York, NY"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "42-09
## 28th Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "42-09
## 28th Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "470
## Vanderbilt Ave"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "59-17
## Junction Blvd Corona Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Not Used"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Not Used"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "125 Worth
## Street, Nyc"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "125 Worth
## Street, Nyc"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "80 Maiden
## Lane"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "80 Maiden
## Lane"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "560 Brook
## Avenue Bronx New Yor"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "110
## William St. N Y"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "110
## William St. N Y"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "255
## Greenwich Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "255
## Greenwich Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "80 Maiden
## Lane"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "80 Maiden
## Lane"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "1 Centre
## St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "280
## Broadway, 7th Floor, N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "280
## Broadway, 7th Floor, N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "32-20
## Northern Blvd, L.I.C. Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "120
## Broadway, New York, NY"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "120
## Broadway, New York, NY"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Prospect
## Pk 95 Ppw &5Th St"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "79Th St &
## Riverside Dr."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "79Th St &
## Riverside Dr."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "79Th St &
## Riverside Dr."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "345 Adams
## St., Brooklyn"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Not Used"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Not Used"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Prospect
## Pk 95 Ppw &5Th St"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "125 Worth
## Street, Nyc"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "125 Worth
## Street, Nyc"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "17
## Bristol Street Brooklyn Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "24 West
## 61 Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "33 Beaver
## St, New York Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "33 Beaver
## St, New York Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "24 West
## 61 Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "255
## Greenwich Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "255
## Greenwich Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "24 West
## 61 Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "255
## Greenwich Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "255
## Greenwich Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "255
## Greenwich Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "255
## Greenwich Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "42-09
## 28th Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "42-09
## 28th Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "165
## Cadman Plaza East"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "165
## Cadman Plaza East"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "420 East
## 38Th St."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "420 East
## 38Th St."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "42-09
## 28th Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "42-09
## 28th Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "100 Gold
## Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "100 Gold
## Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Not Used"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Not Used"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "28-11
## Queens Plaza No., L.I.C."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "28-11
## Queens Plaza No., L.I.C."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "Flushing
## Meadow Pk Olmsted Ctr"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "1 Centre
## St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "1 Centre
## St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "110
## William St. N Y"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "110
## William St. N Y"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "150
## William Street, New York N"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "55 Water
## St Ny Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "1 Centre
## St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "1 Centre
## St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "66 John
## Street, New York, Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "250
## Broadway, N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "250
## Broadway, N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "210
## Joralemon St., Brooklyn"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "350 St.
## Marks Pl., Staten Isla"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "80 Maiden
## Lane"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "80 Maiden
## Lane"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "1274
## Bedford Ave., Brooklyn"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "83 Maiden
## Lane, New York Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "83 Maiden
## Lane, New York Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "151-20
## Jamaica Avenue"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "100
## Church St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "100
## Church St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "59-17
## Junction Blvd Corona Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "59-17
## Junction Blvd Corona Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "9
## Metrotech Center, Brooklyn N"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "9
## Metrotech Center, Brooklyn N"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "100 Gold
## Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "100 Gold
## Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "80 Maiden
## Lane"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "30-30
## Thomson Ave L I City Qns"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "30-30
## Thomson Ave L I City Qns"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location =
## "240-250-252 Livingston Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "110
## William St. N Y"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "110
## William St. N Y"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "75-20
## Astoria Blvd"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "100
## Church St., N.Y."
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "155 West
## Broadway New York N Y"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "350 Jay
## St, Brooklyn Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "350 Jay
## St, Brooklyn Ny"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "42-09
## 28th Street"
## Warning: geocode failed with status OVER_QUERY_LIMIT, location = "42-09
## 28th Street"
First, we use longtitue and latitute of 2500 observations and plot their location in the map by different job category and text with average annual salary.
# plotly worldwild
Sys.setenv('MAPBOX_TOKEN' = 'pk.eyJ1IjoieHVxaW5nc2FsbHkiLCJhIjoiY2phZWh0djdyMHUzZTJ3bGR3MHFsdmIzZSJ9.vhYtu7zeAAuX6slhdDj6lA')
p <- nyc_jobs %>%
mutate(text_label = str_c("Annual mean salary$:",salary_mean))
plot_mapbox(p,lat = ~lat, lon = ~lon,
size=2,
split = nyc_jobs$job_category,
mode = 'scattermapbox') %>%
add_markers(
text = ~text_label,
color = ~job_category,size = I(8)) %>%
layout(title = 'Work Location',
font = list(color='white'),
plot_bgcolor = '#191A1A', paper_bgcolor = '#191A1A',
mapbox = list(style = 'dark'),
legend = list(orientation = 'h',
font = list(size = 8)),
margin = list(l = 25, r = 25,
b = 25, t = 25,
pad = 2))
Although the data set is about the job information of NY, there are some work locations outside NY.
We draw the distribution of average annual salary of different job categories.
plot_ly(nyc_jobs,y = ~salary_mean, color = ~job_category, type = "box",
colors = "Set2") %>%
layout(title = "Annual average salary distribution in 12 job categories")
We found that Technology and Engineering have high salary in as a Whole and Clerical jobs have lowest salary. It makes sense that most of technology and engineering work need advanced skills and knowledge which deserve great pay.job_data=nyc_jobs%>%
select(job_category, salary_range_from, salary_range_to,minimum_qual_requirements,full_time_part_time_indicator,salary_frequency)%>%
filter(job_category!= " ", minimum_qual_requirements!=" ",full_time_part_time_indicator=="F",salary_frequency=="Annual")
x=c("baccalaureate", "Bachelor")
y=c("Master","master")
master_data=filter(job_data,grepl(paste(y, collapse = "|"),minimum_qual_requirements),!grepl(paste(x, collapse = "|"),minimum_qual_requirements))
baccalaureate_data = filter(job_data,grepl(paste(x, collapse = "|"),minimum_qual_requirements),!grepl(paste(y, collapse = "|"),minimum_qual_requirements))
Other_data=filter(job_data,!grepl(paste(y, collapse = "|"),minimum_qual_requirements),!grepl(paste(x, collapse = "|"),minimum_qual_requirements))
plot_ly(master_data, y = ~salary_range_from, color = ~job_category, type = "box", colors = "Set2")%>%
layout(title = "Base salary of jobs required at least master degree")
plot_ly(baccalaureate_data, y = ~salary_range_from, color = ~job_category, type = "box", colors = "Set2")%>%
layout(title = "Base salary of jobs required at least baccalaureate degree(No need of master's)")
plot_ly(Other_data, y = ~salary_range_from, color = ~job_category, type = "box",
colors = "Set2")%>%
layout(title = "Base salary of different kind of jobs without requirement of degree")
job_data=mutate(job_data, salary_range=salary_range_to-salary_range_from)
plot_ly(job_data, y=~salary_range, x= ~job_category, type="bar")%>%
layout(title = "The wage increasing ranges of different kinds of jobs")
job_positions = nyc_jobs %>%
select(x_of_positions, agency, job_category, salary_mean) %>%
distinct() %>%
group_by(agency, job_category) %>%
summarise(positions = sum(x_of_positions)) %>%
arrange(desc(positions))
# Number of job positions: top 10
knitr::kable(head(job_positions, 10))
| agency | job_category | positions |
|---|---|---|
| DEPT OF HEALTH/MENTAL HYGIENE | Health | 382 |
| ADMIN FOR CHILDREN’S SVCS | Social service | 100 |
| DEPT OF ENVIRONMENT PROTECTION | Engineering | 82 |
| DEPARTMENT OF BUILDINGS | Public safety | 71 |
| DEPARTMENT OF PROBATION | Public safety | 65 |
| DEPARTMENT OF TRANSPORTATION | Engineering | 65 |
| DEPARTMENT OF TRANSPORTATION | Maintance | 60 |
| LAW DEPARTMENT | Legal Affairs | 56 |
| NYC HOUSING AUTHORITY | Maintance | 56 |
| DEPT OF HEALTH/MENTAL HYGIENE | Community | 55 |
The table above shows the top-10 number of job positions in NYC from 2013 to 2017. DEPT OF HEALTH/MENTAL HYGIENE with job category “Health” has the most job positions.
positions_plot = nyc_jobs %>%
select(x_of_positions, agency, job_category, salary_mean, posting_date) %>%
distinct() %>%
separate(posting_date, into = c("year", "month", "day"), sep = "-") %>%
select(-month, -day) %>%
group_by(job_category, year) %>%
summarise(positions = sum(x_of_positions))
plot_ly(positions_plot, x = ~job_category, y = ~positions,
color = ~year, type = "bar") %>%
layout(title = "Number of positions of job categories in each year",
barmode = "group")
The bar chart above shows the number of positions for each job category in each year. From 2013 to 2017, the number of job positions in each category keeps increasing. And all categories have a dramatically increase in job positions in 2017. This might because more and more companies were founded and developed in 2017, thus they need more employees joining in. Besides, since this dataset contains all job information from NYC official job website, as the year increases, more people found this website and created job postings on this site.
nyc_jobs = nyc_jobs%>%
ungroup()%>%
mutate( minimum_qual_requirements = as.character(minimum_qual_requirements))%>%
mutate(preferred_skills = as.character(preferred_skills))
jobs_words_skill = nyc_jobs%>%
unnest_tokens(word,preferred_skills)%>%
anti_join(stop_words)%>%
inner_join(., parts_of_speech) %>%
count(word, sort = TRUE)
jobs_words_requirement = nyc_jobs%>%
unnest_tokens(word,minimum_qual_requirements)%>%
anti_join(stop_words)%>%
inner_join(., parts_of_speech) %>%
count(word, sort = TRUE)
jobs_words_skill %>%
top_n(20) %>%
mutate(word = fct_reorder(word, n)) %>%
plot_ly(y = ~word, x = ~n, color = ~word, type = "bar")
jobs_words_requirement %>%
top_n(20) %>%
mutate(word = fct_reorder(word, n)) %>%
plot_ly(y = ~word, x = ~n, color = ~word, type = "bar")
set.seed(123)
wordcloud2(jobs_words_skill, size = 2,color = 'random-light',
backgroundColor = "gray", fontWeight='bold',
minRotation = -pi/3, maxRotation = pi/3,rotateRatio = 0.8)
wordcloud2(jobs_words_requirement, size = 2,color = 'random-light',
backgroundColor = "gray", fontWeight='bold',
minRotation = -pi/3, maxRotation = pi/3,rotateRatio = 0.8)
If you undertake formal statistical analyses, describe these in detail
What were your findings? Are they what you expect? What insights into the data can you make?